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Abstract 

The output of a discrete Markov source is to be encoded instantaneously by a variable-rate encoder and decoded 
by a finite-state decoder. Our performance measure is a linear combination of the distortion and the instantaneous 
rate. Structure theorems, pertaining to the encoder and next-state functions are derived for every given finite-state 
decoder, which can have access to side information. 

I. Introduction 

We consider the following source coding problem. Symbols produced by a discrete Markov source are 
to be encoded, transmitted noiselessly and reproduced by a decoder which can have causal access to side 
information (SI) correlated to the source. Operation is in real time, that is, the encoding of each symbol 
and its reproduction by the decoder must be performed without any delay and the distortion measure does 
not tolerate delays. 

The decoder is assumed to be a finite-state machine with a fixed number of states. With no SI, the 
scenario where the encoder is of fixed rate was investigated by Witsenhausen JT). It was shown that for a 
given decoder, in order to minimize the distortion at each stage for a Markov source of order k, an optimal 
encoder can be found among those for which the encoding function depends on the k last source symbols 
and the decoder's state (in contrast to the general case where its a function of all past source symbols). 
Walrand and Varaiya extended this finding to a joint source-channel setup with noiseless feedback. 
Teneketzis [Q used ideas from both [1J and [2J and considered the joint source-channel setup for a given 
finite state decoder without feedback. A causal variant of the Wyner Ziv problem [4J was also considered 
by Teneketzis [3J. It is shown in [3 J that the optimal (fixed rate) encoder for this case is a function of the 
current source symbol and the probability mass function of the decoder's state for the symbols sent so 
far. Borkar, Mitter and Tatikonda [5] derived structure theorems of a similar spirit when the cost function 
is a linear combination (Lagrangian) of the conditional entropy of the reproduction sequence and the 
mean square error of the reproduction. The scenario where the encoder is also a finite state machine was 
considered by Gaarder and Slepian in Q. In some cases, the minimization of the distortion (or cost) 
can be cast as a stochastic control problem. In this case, tools developed for Markov decision processes 
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are employed to either solve the optimization problem or to get insights on the structure of the optimal 
solution. Examples of this technique include 1131, 115L 1171, ||8"1. 

When the time horizon and alphabets are finite, there is a finite number of possible deterministic 
encoding, decoding and memory update rules. In principle, a brute force search would yield the optimal 
choice. However, since the number of possibilities increases doubly exponentially in the duration of the 
communication and exponentially in the alphabet size, it is not trackable even for very short time horizons. 
Recently, using the results of Q, Mahajan and Teneketzis flU proposed a search frame that is linear in 
the communication duration and doubly exponential in the alphabet size. 

Real time codes are a subclass of causal codes, as defined by Neuhoff and Gilbert IfTOll . In IfTOl , entropy 
coding is used on the whole sequence of reproduction symbols, introducing arbitrarily long delays. In the 
real time case, entropy coding has to be instantaneous, symbol-by-symbol (possibly taking into account 
past transmitted symbols). It was shown in IfTOll , that for a discrete memoryless source (DMS), the optimal 
causal encoder consists of time-sharing between no more than two memoryless encoders. Weissman and 
Merhav [fTTI extended IfTOl to the case where SI is also available at the decoder, encoder or both. Error 
exponents for real time coding with finite memory for a DMS where derived in [[T2|. 

This work extends flTJ in several directions: The first is extending the result of [1J from fixed-rate 
coding to variable-rate coding, where accordingly, the cost function is redefined so as to incorporate both 
the expected distortion and the expected coding rate. Secondly, we allow the decoder access to causal 
side information. Unlike [1J and [3|, we do not a-priori restrict the encoders to be deterministic and thus 
the encoders can be any stochastic function of all causally available data. While in [lj and [3], it is 
quite clear that deterministic encoders are a-priori optimal, it is not immediately clear in our case, as 
we discuss in the sequel. We show that structure theorems, in the same spirit as those of Witsenhausen 
H] and Teneketzis (3), continue to hold in this setting as well. Moreover, the structure can be simplified 
when the decoder has infinite memory. Finally, we upper bound the loss incurred by using a suboptimal 
next-state function which uses a "sliding-window" over the past decoder inputs. We refer to such memory 
update functions as Markov memory update functions. The upper bound is given in terms of the original 
state alphabet and the window length. The suboptimal system that uses Markov memory update functions 
is analytically more tractable and its optimization is easier since in order to find the best sub-optimal 
system, effectively, as discussed in the sequel, only the encoders need to be optimized. 

In contrast to Q and (3), where fixed-rate coding was considered, and hence the performance measure 
was just the expected distortion, here, since we allow variable-rate coding, our cost function incorporates 
both rate and distortion. This is done by defining our cost function in terms of the Lagrangian 

(distortion) + A • (code length). 

where A > is a fixed parameter that controls the tradeoff between rate and distortion. In [1], the proof of 
the structure theorem relied on two lemmas. The proofs of the extensions of those lemmas to our case are 
more involved than the proofs of their respective original versions in [1J. To intuitively see why, remember 
that the proof of the lemmas in [1], relied on the fact that for every decoder state, source symbol and 
a given decoder, since there is a finite number of possible encoder outputs (governed by the fixed rate), 
we could choose the one minimizing the distortion. However, in our case, such a choice might entail a 
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large expected coding rate, and although minimizes the distortion, it will not minimize the overall cost 
function (especially for large A). Furthermore, unlike the case of [Q, in our setting, the cost in future 
stages depends non-linearly on the choices of earlier encoders and in contrast to fl] and flU, there is no 
reason, as we discuss in the sequel, to a-priori assume that deterministic encoders are optimal. 

The remainder of the paper is organized as follows: In Section [TTJ, we give the formal setting and 
notation used throughout the paper. In Section [nil we start with the simpler setting without SI. Structure 
theorems regarding the encoder are derived for both the finite and infinite memory models. In Section 



IV we upper bound the loss incurred when Markov memory functions are used instead of the optimal 



next-state functions. In Section [V] we exend the setting of Section [HI] by allowing the decoder access to 
SI. We begin each section by stating and discussing its main result. Finally, we conclude this work in 
Section I 



II. Preliminaries 

We begin with notation conventions. Capital letters represent scalar random variables (RV's), specific 
realizations of them are denoted by the corresponding lower case letters, and their alphabet - by calli- 
graphic letters. For i < j (i, j - positive integers), x\ will denote the vector (scj, . . . , Xj), where for i — 1 
the subscript will be omitted. Px(-) will denote a probability measure over X. When there is no room 
for ambiguity, we will use P(x) instead of Px(x). 1 {A} will denote the indicator of the event A. 

We consider a Markov source producing a random sequence X x , X 2 , ...,Xt, X t G X, t = 1, 2, . . . , T. 
The cardinality of X, as well as those of other alphabets in the sequel, is finite. The probability mass 
function of X±, P(x x ) and the transition probabilities, denoted by P(x t \x t - X ), t — 2, 3, . . . , T are known. 

Let 3^ denote the index set {1,2, . . . , M} for some finite M. A variable-length stochastic encoder 
is a sequence of functions {ft} t=1 - At stage t, a stochastic encoder uses all the causally available data, 
(X*, to choose a probability measure over y from which Y t is drawn. After drawing Y t , the encoder 

noiselessly transmits an entropy-coded codeword of Y t . A deterministic encoder is a stochastic encoder 
which draws a specific Y t E y with probability 1 (i.e., Y t is a deterministic function of (X 1 , F* -1 )). Unlike 
the fixed rate regime in (D.lUl. where log 2 \y\ (rounded up) was the rate of the code at stage t, here the 
subset of y used at each stage, along with the length of the binary representation of Y t , will be subject 
to optimization. 

The encoder structure is not confined a-priori, and at each time instant t, Y t may be given by 
an arbitrary (possibly stochastic) function of as described above. The decoder, however, is 

assumed, similarly as in [1J and [3], to be a finite-memory device, defined as follows: At each stage, t, 
the decoder updates its current state (or memory) and outputs a reproduction symbol X t . We assume that 
the decoder state, Z t , is updated by 

Z 1 = r x {Y x ) 

Z t = r t (Y t ,Z t „i), t = 2,3,...,T (1) 

Since the transmission is noiseless, Z t can be tracked by the encoder. Note that this model also includes 
infinite memory, i.e., Z t = Y*. The reproduction symbols are produced by a sequence of functions {gt}, 
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g t : y x Z — > X as follows 



*i = 0i(li) 

x t = ^(y t ,z t _!), 



2,3, 



T 



(2) 



Since at the beginning of stage £, is known to both encoder and decoder, the entropy coder at 
every stage needs to encode the random variable Y t given Z t _i = z t -\. We define A to be the set of all 
instantaneously uniquely decodable codes for y, i.e., all possible length functions / : y — > Z + U oo that 
satisfy Kraft's inequality: 



A 



f('):£2- 



< 1 



(3) 



Note that we allow infinite-length codewords. We will return to this technical issue after properly defining 
the cost function. The average codeword length at stage t, for a specific decoder state zt-\, will be given 
by: 



LY t \Z t -i( z t-l] 



A 







if maxyt^f P(y t \z t - 1 ) = 1 



mm K-)e^ \ T, yte y f otherwise 



(4) 



i.e., if given Z t _i = z t _i, Y t is deterministically known, there is no need to transmit any informa- 
tion, otherwise L/Y t \z t -i{ z t-i) is obtained by designing a Huffman code for the probability distribution 
PYt\z t -A'\ z t-i) ■ Note that for given encoders and state update functions, Ly t \z t _ x {zt-i) is a function of 
z t -x only. Also, L Yt \z t _ x (zt-\) is discontinuous around in the distribution -Py t |z t _i('kt-i) since if given 
Z t -i = Zf-i, Y t is not deterministically known, then L Yt \z t ~i( z t-i) > 1- 

The average codeword length of stage t, denoted Ly t \z t _ x , is defined as ELy^t-A^t-i), where the 
expectation is with respect to Z t _ x . Our system model is depicted in Figure [T] 
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Fig. 1: System model 

We are given a sequence of distortion measures {pt}J =1 , Pt '■ X x X — > iR + - At each stage, the cost 
function is a linear combination of the average distortion and codeword length Ly t \z t _ Y , i.e., 

A 



J t = E [ Pt (X t ,X t ) + \L Yt \ Zt _ x {Z t ^ , (5) 
where A > is a fixed parameter that controls the tradeoff between rate and distortion. Our goal is to 
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minimize the average cost 

t=i 

A sequence of encoders, fi, . . . , fx, is said to be optimal if for a given sequence of decoders and 
memory update functions, fi, . . . , f T attains inf J, where the infimum is over the set of all sequences of 
stochastic encoders, which are functions of all causally available data. 

A stage-t encoder is said to be optimal if given the future stages encoders and decoders, it attains 
inf Yli=t Ji> wnere me infimum is over the set of stochastic stage-t encoders (which are functions of 

Note that for large enough A, for some z t -i, the optimal encoders might use only a small subset of 
y (thus attaining higher distortion but smaller overall cost). Technically, this means that there will be a 
subset Bey such that P(y t \z t -i) — if y t £ B c . We therefore need that A will contain good codes 
for subsets of y. By allowing infinite length codewords, we make sure that A contains codes which are 
uniquely decodable for all subsets of y (and satisfy Kraft's inequality for alphabet y). Needless to say 
that with this definition, a code for B C y will be used iff P(y t \zt~i) = for all y t G B c , where we use 
■ oo = 0. 



III. Structure Theorems - No Side Information 

A. Main results 

We start by briefly stating and discussing the main contributions of this section. The proofs of the 
following theorems are found in the following subsections. 

The first contribution of this paper is the following theorem, which basically states that the results of 
HI continue to hold in this setting as well. 

Theorem I. For a Markov source and any given sequence of memory update functions {r 4 }, reproduction 
functions {g t } and distortion measures {pt}, there exists a sequence of deterministic encoders Y t = 
ft(X t , Z t -\) which is optimal. 

The addition of the variable-rate coding and allowing a larger class of encoders compared to JH, makes 
the proof of this result considerably more involved than its counterpart in flU, as was discussed at the end 
of Section H 

While Theorem [j] covers the infinite decoder memory (Z t = Y*) setting, in this case, when optimal 
reproduction functions are used (see Section IITE), we have the following theorem, which refine Theorem 
|I] for this case: 

Theorem II. For a Markov source and any sequence of distortion measures {pt} and optimal infinite 
memory decoders, there exists a sequence of deterministic encoders Y t = ft(X t , Px t |y t - 1 ('|y t-1 )) which 
is optimal. 

We will show that Px t \Y t - 1 (-\y t ^ 1 ) can be recursively updated. Theorem [n] is a refinement of Theorem 
[l] since, in the setup of Theorem |nj there is no need to store the whole history of encoder outputs, Y*, as 
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the statement of Theorem [ij but instead, P Xt \Y t - 1 {-\y t ^ 1 ) is recursively updated, (given that a probability 
measure can be stored). 

In the remainder of this section, we will prove Theorems [I] and [TTJ. starting with Theorem [TJ In order to 
prove Theorem |I} we need a few supporting lemmas, as in 0]|. In the following two subsections, we state 
and prove the supporting lemmas and then prove Theorem [I] in Subsection III-D Theorem |n] is proved 
in Subsection IHI-El 



B. Two-stage lemma 

We start by analyzing a system with two stages only, where the first encoder is known. 

Lemma I. For any two-stage system (T = 2), there exists a deterministic second stage encoder Y 2 = 
f 2 (X 2 ,Zx), which is optimal. 

Proof: Note that fi,gi,g 2 ,rx are fixed, and so, J x is unchanged by changing / 2 . We need to show that a 
second stage encoder, that minimizes J 2 , can be a deterministic function of (X 2 ,Zx). Denote the set of 
stochastic encoders which are functions of (X 1 ,X 2 ,F 1 ) by {f X 2 Yl }- For every joint probability measure 
over the quadruple (Xl, X 2 , Y 2 , Z\), J 2 is well defined and our objective is to find the optimal encoder 
that attains: 

inf J 2 = inf E{p 2 (X 2 ,g 2 (Y 2 ,Z 1 )) + XL Y2lZl (Z 1 )}. (7) 

Consider the random quintuple (Xx, X 2 ,Yx,Y 2 , Zx) which takes part in the expectation of Q. From the 
structure of the system, we know that 

P(x 1 ,x 2 ,y 1 ,y 2 ,z 1 ) = P(x 1 )P(x 2 \x 1 )P(y 1 \xi)P(y 2 \x 1 ,x 2 ,y 1 )l {rx(yx) = z x } , (8) 

where we used the fact that z x is a deterministic function of y x . Everything but the second stage encoder, 
which directly affects P(y 2 \xi 1 x 2 , y±) is fixed. Note that the optimization affects L Y2 \z 1 (Zi) since 

L Y2 \ Zl ( z i)= min Y] y^P{x 2 \z 1 )P{y 2 \x 2 ,z l )l{y 2 ) (9) 

and P(y 2 \x 2 ,zi) depends on P{y 2 \x\, x 2 , y x ) as we will show shortly. 

Let {fx 2 Zi} denote the subset of stochastic encoders which are functions of (X 2 , Z\). Also, let {f^zA 
denote the subset of deterministic encoders which are functions of (X 2 ,Zi). Since Z\ is a function of 
Yx, {fx 2Zl } c if^zA c ifx^}- ^ e P rove Lemma |l] in two steps. First, we show that it is enough 
to search in the (infinite) subset {fx 2 z x }- ^ n tne second step, we show that among {/x^}, tne optimal 
encoder is a member of {f^Zj}- 
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Step 1: We rewrite @ as follows: 



inf J 2 = inf ^ p ( x 2,V2,z 1 ) [p 2 (x 2 , #2(2/2, Zi)) + XL Y2 \ Zl (zi)] 



{f S x 2 Y } 



{/t2v } 

X Yl X 2 ,V2,Zl 



P2(x 2 , g 2 (y 2 , zj) + A min V V P(x' 2 \ Zl )P(y' 2 \x' 2 , Zi)l(y' 2 \ 



Now, given that the first stage encoder and decoder are known, P(x 2 , z\) is well defined since 

P(x 2 ,z 1 ) = P(xi,x 2 ,yi,z 1 ) 

^1,2/1 

= P(x 1 )P(x 2 |x 1 )P(y 1 |x 1 )l {n( yi ) = Zl } 



(10) 



(11) 



zi>yi 



and P(a;i), P(x 2 |xi), 1 {ri(yi) = zi} are determined by the known source and first stage next-state 
function, P(y 1 \xi) is directly determined by the first stage encoder. Also, by the Bayes rule, we have, for 
any second stage encoder: 



P(x 2 ,y 2 ,zi) = P(y 2 \x 2 ,zi)P(x 2 , Zl ). 



(12) 



Therefore, 



inf J 2 = inf V P(y 2 \x 2 , z{)P(x 2 , z x ) 
3 9 } {/% I 



X 



P2(x 2 ,g 2 (y 2 ,z 1 )) + X min V V P(x 2 |^a)P(j/ 2 |4, £i)% 2 ) 

i(-)6A , 



The only term that is affected by the optimization is P(y 2 |z 2 , z\). Observe that by ([8]), we have 

D / 1 x P{xi, x 2 ,yi, y 2 , zi) 
P(y 2 \x 2 , Zl ) = ^ — — 



(13) 



%i,yi 



xi,yi 



P(x 2 ,Zi) 

P(xi)P(x 2 \xi)P(y 1 \x 1 )P(y 2 \x 1 ,x 2 ,y 1 )l {n(yi) = z x } 

P(x 2 , Zl ) 



(14) 



From ( [T3] ), ([14]), it is evident that the role of the second stage encoder in a two stage system is to select 
P y2 \x2,zi{-\x 2 , zi) for every (x 2 , z\) so as to minimize the cost. To see this, note that every f 2 e {fx^y } * s 
mapped by ( fT4| ) (through P(y 2 |zi, x 2 , y\) for every (xi,x 2 ,t/i)) to a point on the simplex of probability 
measures on y for every (x 2 ,Zi). Namely, every f 2 E {fx 2 Yi} is mapped to f 2 G {/x^} anc l tne 
optimization is affected only by f 2 . If instead of using a specific / 2 we will use f 2 that results from it 
through ( [14] ), the joint probability P(x 2 , y 2 , Z \) will remain the same and therefore, also the second stage 
cost. Also note that we cannot gain anything from optimizing only over {fx^} an( l not {fx 2 Yi} s i nce 
{fx 2 Zi\ ls completely covered by {f X 2y } through ( fT4] ). Therefore, since the optimization over {f X 2y} 
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is mapped to an optimization over {/x 2 ^ 1 }> we have 

inf J 2 = inf J 2 (15) 

which completes the fist step of the proof. 

Step 2: To complete the proof of the two-stage lemma, we need to show that it is enough to search in the 
finite space of deterministic encoders which are functions of (X 2 ,Zi). Observe that the set of stochastic 
encoders is a convex set. The extreme points of this set (the points that are not convex combinations 
of other points) are deterministic encoders, namely, the set {fx 2 Zi}- To complete the proof, we use the 
following lemma, proved in Appendix |A} 

Lemma II. The stage t loss function is concave in {/jkz t _i}* 

Using Lemma |II} we conclude that since we minimize a concave function over a convex set, the 
minimizer will be one of the extreme points of the set, i.e., a member of {fx 2 Zi}- We thus showed that 

inf J 2 = min J 2 (16) 

This completes the proof of the two-stage lemma. Note that no assumptions on the statistics of the 
source were made in the proof and therefore, the two-stage lemma holds for any source.B 

Discussion: 

1. Observe that the actual optimal encoding function for each (x 2 , z±) depends on the encoder of the first 



stage through P(x 2 ,£i) (which also governs P(x 2 |zi)), as seen from ( [T"3] ). This is true in general and 



not only in a two-stage system. The joint distribution Px t ,z t -i{'j ') can be thought of as the state of the 
system, governed by the choices of previous encoders (note however, that this state is static in the sense 
that it is not influenced by the actual realization of the source sequence). Therefore, the role of the stage 
t encoder, besides greedily minimizing the stage t cost (given the state Px t ,z t -i{'i •))> ls t0 control the 
future states so that they will allow minimal costs in future stages. This is true for all but the last encoder, 
which does not affect future cost, as seen for the second stage encoder in a two stage system. We will 



come back to this issue in Subsection III-E when we deal with infinite memory decoders and apply tools 
of stochastic control. 

2. It is not surprising that the optimal second stage cost is attained by a deterministic encoder. Since the 
second stage is the last stage, the last encoder does not affect future costs and therefore, instead of using 
a convex combination of deterministic encoders (i.e., a stochastic encoder), use only the one with the 
best performance. However, in a system with more stages, it is not immediately clear that deterministic 
encoders in intermediate stages are optimal. In fact, this is also true for the first stage of a two stage 
system. We saw that the first stage affects the second stage cost through P(x 2 , Zi). Specifically, it affects 



the second stage cost through P(x 2 \z\), (as seen in (fTO])) which is non linear in the first stage encoder 
P(y i\xi) since 

Df , v Y. Xim p {xi^2)P{yi\xi)t{r l {y l ) = Zl } 

P X2 Zl = \- jd( \r>( I mi j t \ V (17) 

T JXX , X2 , yi p { x u x 2)P{yiW)t{ri{yx) = zx) 

If the first-stage encoder is deterministic, there is only a finite number of possible P x . 2 \z 1 (-, ■)• Assume 
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that we use a stochastic first-stage encoder, ff. Although by Lemma [n} /* is sub-optimal for Ji, can it 
allow us to reach a P'x^zS'i ')> unreachable by deterministic encoders, that will be favorable in terms of 
J 2 and yield a lower overall cost? We show in the sequel that the answer is negative and that the optimal 
first-stage encoder is deterministic as well. We will show that the stage t cost is a concave functional in 
the choices of the previous stages encoders. The proof of the last statement is much more involved than 
the proof of Lemma [II] and it is discussed in the next subsection. In [Q]|, 0, the stage t distortion is linear 
in the choice of the encoders at all previous stages (since the expectation is linear and the non-linear 
element of the codeword length was not present). Therefore, there was no loss of optimality in a-priori 
confining the encoders to be deterministic. We further address this issue in the following subsection which 
deal with a more complex system. 

Corollary I. In any T-stage system (T > 2) there exists a deterministic last stage encoder Yr = 
f T (X T , Zt-i), which is optimal. 

Proof: Let X 1 = (X 1; X 2 , . . . , X T . l ),X 2 = X T , Zi = Zt-i, where Z\ is calculated recursively according 
the the encoding functions that operate on X\ and the resulting Yi, . . . , Yr-i. We now apply the two-stage 
lemma to this system to conclude that the last stage encoder is a deterministic function of (X t , Z T _i). ■ 

C. Three-stage lemma 

Lemma III. In a three-stage system (T = 3) with a Markov source, if the third-stage encoder is a 
deterministic function of (X 3 , Z 2 ), then there exists a deterministic second stage encoder Y 2 = f2(X 2l Z\), 
which is optimcd. 

Proof of Lemma 



IIP We define, as in Subsection III-B , {f X 2 Yl } to De the set of all possible stochastic 



second-stage encoders. Let {fx 2 z^} c ifx^} ^ e me set mat contains all stochastic second stage encoders 
that are functions of (X 2 , Z\) and finally, let {/j^Zi) c {/x 2 Zi} denote the set of deterministic encoders 
which are functions of (X 2 ,Z 1 ). Since the first-stage is fixed, J\ is unaffected by changing the second 
stage encoder. Our goal is to jointly optimize ( J 2 + J 3 ) with respect to the second stage encoder and show 
that 

inf (J 2 + J 3 ) = min (J 2 + J 3 ) . (18) 

Since the third stage encoder is known, the expected third stage cost for any second stage encoder is 
given by 

J 3 = E {p(X 3 ,g(Y 3 , Z 2 )) + L Y3lZ2 (Z 2 )} 

= ^ p ( x 3,z 2 )P(y 3 \x 3l z 2 ) [p(x 3 ,g(y 3 ,z 2 )) + Ly 3 \z 2 (z 2 )] 

X3,V3,Z2 

= S~] P(x 3 ,z 2 )l {f 3 (x 3 ,z 2 ) = y 3 } p(x 3 , g(y 3 , z 2 )) + S^P(z 2 ) min 1 {f 3 (x 3 , z 2 ) = y 3 } P(x 3 \z 2 )l(y 3 ). 
L — ' L — ' l(-)eA l — ' 

(19) 

The second-stage encoder affects the last expression through P(x 3 ,z 2 ) (and thus also through P(z 2 ) and 
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P{x$\z2f) since 



£2,2/2)21 

= ^2 P(x2,z 1 )P(y 2 \x2,z 1 )P(x 3 \x2)t{r 2 (y2,z 1 ) 



(20) 



£2,2/2,21 



where P(x 2 , Zi) is the result of the first-stage and we used the fact that the source is Markov and that z 2 
is a deterministic function of (y 2 ,zi). Therefore, as we saw in Subsection III-B the optimization affects 
the third-stage only through P(y 2 \x2,z 1 ) for all (x 2 ,y2,z 1 ). We saw in ( [T3~] ) that the second stage cost 
can be written as: 

J2 = P(y 2 \x2,Zi)P(x 2 ,Zi)x 



£2,2/2,21 



P2(x 2 ,92(y2,z l )) + X min V V P^y^P^xf,, 

y 2 x 2 



(21) 



where P(x 2 , zi) and thus P(x 2 |zi) are the result of the first-stage encoder. We see that the optimization 
in the l.h.s of ( |T8| ) affects both the second and third stage costs only through the conditional probabilities 
P (2/2 1#2, Zi), for all (x 2 ,y2,Zi). Repeating the arguments used in the proof of Lemma [j] instead of using 
a specific f 2 G {f X 2 Yl }i we can use f 2 G {fx 2Zl } mat results from it through ( [14) ), to draw Y 2 . Since 
P(x 2 , 2/2) Zi) will remain the same, 



P(^3,^2)= P{x2,Zi)P(y2\x2,Zi)P(x 3 \x2)t{r2{y2,Z 1 ) = Z 2 } 



(22) 



£2,2/2)21 



will also remain the same and ( J 2 + J 3 ) will not be affected by this step. We therefore have 



{/ 



inf (J 2 + J 3 ) = inf (J 2 + J 3 ) . 



{fx 2 z 1 } 



(23) 



As in the two stage lemma, we need to show that it is enough to search in the space of deterministic 
encoders, {fx 2 z L }- Here, we have to show that both the second stage cost and the third stage cost are 
concave in {fx 2 z L }- We know that the second stage cost is concave in {f X2 Zi} fr° m Lemma |nj The 
following lemma asserts that the third stage cost is concave in {/jL Zl }. 

Lemma IV. The third stage cost, J 3 , is concave functional of {fx 2 z x }- 

The proof Lemma IV is much more involved than the proof of Lemma [XT] and can be found in Appendix 

m 



Using lemma IV, we conclude that (J 2 + J3), which is the sum of two concave functionals, is concave 
in {fxnz,}' Therefore, the minimizer will be one of the extreme points of the convex set of {/jLziK 
namely, a member of {f x Zl }- We showed that 



inf ( J 2 + J 3 ) = min ( J 2 + J ; 



{fx 2 Zi 1 



(24) 



Using p3[ ), we arrive at ( [18] ) which completes the proof of Lemma III 
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D. Proof of Theorem |7| 

With the two- and three-stage lemmas, we can prove Theorem [I] by using the method of 0], used for 
fixed rate encoding. Theorem [I] is proven by backward induction. First apply Corollary [I] to any system 
to conclude that the optimal f T is a deterministic function of (X T , Z T -i). Now assume that the last m 
encoders f T - m +x, fr are deterministic functions of (X T - m +i, Z T - m ), . . . , (Xt, Z T -i), respectively. We 
will show that the encoder at time (T — m) also has this structure and continue backwards until t = 2. The 



first encoder is trivially a function of X x and by lemma IV (with Z as a constant) it is also deterministic. 
Let 

Xi = (Xi,X 2 ,---,Xt 
Y 1 = (Y 1 ,Y 2 ,...,Y T - m -i), 
Z x = r^x), 

X 2 = Xt-tu, 

% = Yr- m , 
Z 2 = r T-mXX^, Zi), 
X3 — (XT-m+l,XT-m+2, ■■-,Xt), 

Y3 — (Yr-m+i, YT-m+2, ■••> Yt), (25) 

where Z\ is recursively calculated from Y\ and it represents the state of the decoder after (T — m — 1) 
stages. Using this new notation, the encoder that produces Y 3 is a deterministic function of (X 3 ,Z 2 ) 
(since, by assumption, the last m encoders have the desired structure). The source is Markov since X 3 
is independent of Xi given X 2 (since the original source is Markov). Now, by the three-stage lemma, 
Y 2 = Yr-m = fT-m(X 2 , Zf) = fT-m(X T - m , Z T _ m _i). Thus, the induction step is proved. This completes 
the proof of Theorem []]■ 

Remark: Theorem [I] can be extended to a A;-order Markov source using Witsenhausen's method (T). 
Namely, for a A;-order Markov source, define X\ = (Xi, X 2 , . . . , X k ), X 2 = (X 2 , X 3 , . . . , X k+ i) and so 
on. Now, X t is a Markov source. Using Theorem [j] we can conclude that the optimal encoder is a function 
of the last k source symbols and the state of the decoder. 

E. Infinite memory decoder - proof of Theorem [77| 

In this section, we deal with the case where the decoder has infinite memory, i.e., Z t = Y l . The memory 
update functions {r t } in this case are only appending the new received index Y t to Z t _\. Note that this 
scenario is covered by Theorem |IJ however, in this case we can be more specific regarding the role of Y l 
at the encoder. While Theorem [j] was true for any decoding rule, Theorem |ll] is true only for the optimal 
reproduction function. We define the Bayes Envelope as 

B(P Xt \ Y t) = min'y]P(xt\y t )pt(xt,Xt). (26) 

xt 

The minimizer of the last expression is called the Bayes-response and will be denoted by X Bayes (P Xt \Y t )- 
Clearly, X Bayes (P Xt \Y t ) is a function of P Xt \Y t (-\y t ) and the cost function, p t . The fact that the optimal 
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reproduction function is the Bayes-response was shown in many places, for example 0,(81 Lemma 3]. 

When infinite memory is available, we can use tools from Markov decision processes (MDP's) in order 
to derive a structure theorem. In Appendix |C} we provide a brief background on MDP's. By Theorem |I} 
we know that we can confine the discussion to deterministic encoders without loss of optimality. We need 
to show that our original problem can be represented as a MDP. The proof of Theorem |TT] will follow 
immediately from Theorem A.l[ given in Appendix |Cj In order to show that we have an MDP, as we 



discuss in Appendix |Cj we need to show that: 

• We can a find a sequence of deterministic functions {'ft}, along with two finite spaces, S, A, such that 
the average cost, defined by can be written as J = J2t=i 7t( s t> a t), where s t E S, a t £ A 
are the system state and the action taken by the decision maker at stage t, respectively. 

• The next state is chosen according to P(s t+ i\s t ,a t ) = P(s t+ i\s t , a t ), i.e., the state is Markov 
conditioned on a t . 

We define our state as s t = Px t \Y t - 1 (-\y t ^ 1 ) and our actions a t : X — > y. We note that for every history 
the general deterministic encoder (which is a function of x l ) is a mapping from x t to y t . Our action, 
a t , is this mapping. Since there is only a finite number of mappings from X t to Y t , our action space is 
finite. Our state space is also finite. This is true since we consider only deterministic encoders, from which 
there is only a finite number. Therefore, at each stage, there is only a finite number of possible Px^y*- 1 - 
This means that the cardinality of the state alphabet, grows with the time horizon T. Note however, that 
the decoder's state alphabet, Z t = y l , grows as well in this case. We start by showing that the cost 
function can be written as a function of the current state and action. Treating the codeword length first: 

Ly^iy 1 - 1 ) = minJTPto'- 1 )/^) 

yt 



min y^P{y t ,x t \y t ^(yt) 

yt,x t 

min V Pixtly^Piytlxt, y 1 ' 1 )^, 

Ac A < • 



yt,%t 

= FJ in , Pixtly'- 1 )! {a t (x t ) = y t } l{y t ) 

yt,xt 

= a t (s t ,a t ), (27) 

where the equation preceding the last one is true since we know the function from x t to y t . We now move 
on to the average distortion. We first show that the optimal reproduction function, XsayesiPx^Y 1 ), is a 
function of (a t , s t ,yt)- To see this note that 



P(x t \y t - l )t{a t (x t )=y t ] 
E Xt P(xt\y t - 1 )l{a t (x t ) = y t } 



A 



f(s t ,a t ,x t ,yt). (28) 
Therefore, the optimal reproduction function, which is a function of Px t \Y* ( - 1 2/* ) , is a function of (s t , a t , y t ), 
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i.e., X Bay es{Px t \Y*) = 9*t (s t , a t , yt) ■ Using this notation we have 



E 



p{Xt,g* t {st,a t ,Y t )) 



Y 



t-i 



.t-i 



P(x t , VtW 1 )p{x t , g* t {s t , a t , y t )) 

%t,yt 

^2 P(x t \y t ~ 1 )P(y t \x t , y* _1 )p(x t , g*(s t , a t , y t )) 
%t,yt 

Y P(xt\y^ l )t {a t (x t ) = y t } p(xt,gt(s t ,a t ,yt)) 



xt,yt 



A 



/3 t {s t ,a t ). 



(29) 



Denoting (3 t (s t , a t ) + Xa t (s t , a t ) = 7t(s t , a t ), our optimality criterion can be written as ^ Ylt=i Elt{st, a t )- 
We move on to show that the state sequence is Markov conditioned on the action, namely, P(s t+ i\s t , a 1 ) = 
P(s t+1 \st,a t ). We start by noting that s t+ i = Px t+1 |y*(-|y*) is a function of (a t ,s t ,y t ). For every x t+ i, 
we have 

T.x.P^t+i^uVtly^ 1 ) 



P{x t+ i\y t ) 



A 



T, xt , Xx+1 P(x t+ i,x t ,y t \y t - 1 ) 

P{x t \y t - 1 -)P{x t+1 \x u y^Pjytlxt, x t +i, y 
J2 Xt , Xx+1 P(xt+i,x t ,yt\y t ~ 1 ) 
Ext Pjxtly^Pjxt+ilxt)! {a t {xt) = y t } 
Y JXt , Xx+1 P(xt+i,x t ,yt\y t - 1 ) 
E at Pjxtly^Pjxt^lxt)! {ot(x t ) = y t } 
E Xt , Xx+1 P(xt\y^)P(x t+1 \x t )l {a t (x t ) = y t } 

f(a t ,s t ,x t +i,yt)- 



(30) 



Therefore, s t+ i = h(a t , s t ,yt), for a function h that uses ( [30] ) for every x t+ \. Now 
P{s t +i = v\s\ a*) = 1 {h(a t , s t , y t ) = v}l {a t (x t ) = y t } P{x t \y 

Vt,xt 

= P(s t+1 = v\s t ,a t ), 



(31) 



since the current prior on x t is given. We showed that our system can be represented as an MDP. By 



invoking Theorem A.l we know that the optimal action at each stage, a t , is a deterministic function of the 
state. Namely, The mapping from x t to y t can be chosen deterministically as a function of P Xt \Y t ~ 1 (-\y t ^ 1 )- 
Therefore, Y t is a deterministic function of (X t , Px- t iyt-i(-||/ t_1 )), which concludes the proof of Theorem 
|nj Since the state can be recursively calculated (see eq. ([30])), the encoder does not need to store y l ~ x 
but rather a probability measure (a vector in IlW). 

IV. Markov Memory Update Functions 

A. Preliminaries and main result 

In Section [III], we showed that for given memory update, distortion and reproduction functions, there 
is no loss of optimality if the encoders use the current source symbol and the state of the decoder, which 
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they track. We will refer to this class of encoders as tracking encoders. In the overall optimization of 
the system, there is still the task of finding the best memory update and reproduction functions at each 



stage. When the memory update functions and encoders are fixed, as we discussed in Section III-E the 
reproduction function should output the X t that minimizes the average distortion for a given (Y t ,Z t -i), 
i.e., the Bayes response of Px t \Yt,z t -i- This is simple since the reproduction function has no influence on 
the future costs (cost to go) and it affects only the present distortion (in a way, for the same reasons, 
the two-stage lemma was simpler than the three-stage lemma). However, similarly to the encoders, the 
memory update function at stage t affects all future costs. In this section, we show that for a "small" 
cost at each stage, one can take Markov memory update functions, defined as sliding windows over the 
received symbols at the decoder and avoid the search for the |£| -states optimal memory update functions. 
The extra cost is a function of \Z\ and the sliding window size only and it vanishes as the window size 
is increased. 
Let 

A|*|= min E^l^lpiXugtiYuZ^ + XLy^iZU 

{n}{ft},{9t} i 1 



t=l 



(32) 



where the minimization is over all next state functions {r t } with a state set of size \Z\ and all decoders and 
tracking encoders that use. Note that we choose here the whole sequence of next-state functions, encoders 
and decoders for t = 1, 2, . . . ,T. We say that the state is Markov of length I, if Z t = {Y t _i, . . . , Yi_i}, 
i.e., a sliding window of length / on the encoder outputs. Let 

T 



A l= min E^Ux.rgtiYuYl'^ + XLy^iY, 



i ) 



(33) 



where here, the minimization is with respect to all decoders and tracking encoders that use a Markov 
state of length I. 

Theorem III. For any source statistics, when considering only tracking encoders, we have for any I that 
divides T: 

1 I S-jT I 

Az > A, - (34) 

The significance of this theorem is more conceptual than operational. The system on the r.h.s might 
require more memory than the system on the l.h.s. and the search for the optimal encoders becomes more 
complex as I increases. However, the system on the r.h.s is conceptually simpler and analytically more 
tractable since the memory structure is simple. 

Combining Theorem III with Theorem [I] we have the following theorem: 



Theorem IV. For a Markov source, there exists a system with deterministic encoders Y t = ft(X t , Z t -\) 
and Markov memory update functions with a performance loss no greater than X log J^ per source symbol, 
compared to the optimal system. 

Theorem [Tll| can be extended to the case where instead of our Lagrangian cost function, we would look 



15 



for the minimal average distortion subject to an average length constraint. Let 



mm 

{ft},{9. 



MR) 



A 



S.t 



min E 

{ft},{9t} 



.MM [I ^ j 

I t=l 

t=i 

I t=i 



E 



(35) 



where the minimization is over all tracking encoders that use X t and the decoder's state, reproduction 
functions and state update functions (in Az{R) only). We have the following theorem: 



Theorem V. In the constrained setting, for any I that divides T we have 

log \Z\ 



A Z (R) >A l [R + 



I 



(36) 



Note that here we do not have a theorem in the spirit of Theorem IV since we did not show that in 
this case, tracking encoders are optimal. 

In the next subsection, we prove Theorem III Theorem IV is a direct consequence of Theorem [I] and 
Theorem III combined. Theorem [V] is proven exactly in the same manner as Theorem III and its proof is 



therefore, omitted. Theorem III is valid even without taking expectations in (|32[),(|33[) and therefore, it is 
also valid for individual sequences (see |[T3ll ). Theorems III -V] will also hold in the setting of the Section 
[V} where SI is available to the decoder. 



B. Proof of Theorem III • 

The ideas in the proof rely on some ideas from [fT3l . Fix the optimal encoders, state update and 
reproduction functions of Aigi. We start by focusing on the codeword length element of A\z\, using the 
fact that conditioning reduces the length element (see Appendix [D]), we have 

A 1 T 
R = If EL Yt\Zt-i ( Z t-l) 



t=l 
T 



^ f Yl EL Y t \Y t t -\Z t ^ 1 ( Y t-^ Z t~l) 
t=l 



(37) 



Since we will always deal with the expected codeword length, in order to simplify the notation, we 
will use from now ELrY t \z t -i(Zt-i) = Ly t \z t -i ( as defined in Section II). We now add conditioning on 



Z , Zi, Z 2 i, ■ ■ ■ which will further reduce the last expression. Z is added to the first / summands of p7[ ), 
Zi to the summands indexed by I + 1, . . . , 21, and so on. This conditioning makes the conditioning on 
Zt-i redundant since if we know the state in the past and the encoder outputs up to the present, we know 
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the current state as well. We continue by assuming that I divides T: 

1 T 



T 

t=i 



T/l-l jl+l 

- T E E L Yt\Ytlz iX (38) 

j=0 t=jl+l 

Now there are two types of terms: 

1) (Yji + i, Zji) appear together in the conditioning. 

2) Yji +1 is conditioned on Zji and the previous block: Y^_^ +1 . 

We rewrite the sum of ([38]) as two sums, pertaining to the above two types: 

T/l-l jl+l 

R ^fJ2J2 L Yt\Y t *-\ Zjl 
3=0 t=jl+l 
T/l-l jl+l T/l-l 

= f}ZYl L Y t \Y^\ Zjl + L Y jl+1 \Y^_ 1)+1 , Zjl - ( 39 ) 

3=0 t=jl+2 j=0 

We now use the following inequality, which is proved in Appendix [D] 

L v , v ji ? > L v „ lv n — log \Z\ (40) 
Substituting (|40]) in ([39]), we have: 

T/l-l jl+l T/l-l 
R -f}ZYl L Y t \Ylzl^ L Y jl+1 \Y^_ 1)+vZjl 



3=0 t=jl+2 j=0 
1 T/l-l jl+l 1 T/l-l 

- ^ E E L Y t \Ytt,Z 3l + f E L Y jl+1 ,Z 3l \Y^_ 1)+1 j 
3=0 t=jl+2 j=0 

T/l-l jl+l T/l-l 

- y E E L Y t \Y*~\ Zjl + j, E L Y jl+1 , 
3=0 t=jl+2 j= 

Regarding the distortion element of Aizi, we have: 
1 T 



T/l-l Jl+l 1 T/l-l 

_ log \Z \ 

• Z 3l\ Y j(i-i)+v Z 3V-i) I 

3=0 t=jl+2 



t=l 



T/l-l jl+l T/l-l 

T E E Ep(X t ,g t (Y t ,Z t ^)) + - Y Ep{Xj l+l ,g t {Y jl+1 ,Zji)) 

j=0 t=jl+2 j=0 

( T/l-l jl+l T/l-l , 

E E Ep(X t ,g t (Y t t _ l ,Zj l ))+ Ep(Xj l + 1 ,g(Y^ 1)+1 ,Z jh Z jil ^ ) ))y (42) 



> 



j=0 t=jl+2 j=0 



(41) 



In the last inequality, we used the fact that with the same encoders (same {!<}), optimal decoders that use 
more data will do at least as well as the original decoders. Note that in the above derivation, (Y^+i? Zji) 
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always appear together. Therefore, we set for all j = 0, 1, . . . , n/l — 1 Y- m+x = (Yj m+ i, Zji) and for all 
other indexes we set Y( = Y t . Using this notation, we have for ( |4Tj ): 



R>^±L y ,(y;^)- ] ^. (43) 
t=l 



and for (j42J) 



^EpiXuthOTuZt.!)) > min^J2Ep(X t ,g t (Y^)). (44) 
t=i ^ gt ' t=\ 

and each F/ is a function of X t , Y^J 1 . Note that although the size of the alphabet of Y' is now |^| x \Z\, 
the size of the alphabet was not a constraint on the system and was introduced so it will be convenient 
to define A. The fact that it is now larger does not change any of the results obtained in the previous 
sections. We have 

1 T (\ n 1 \Z\\ 

A z >ndn = Xi St0&)) + A T S ^'I^V ~ ~l ' (45) 

^ gt ' t=x \ t=i ' J 

The r.h.s of the above equation was calculated with the optimal encoders of the l.h.s. with a scheme that 
appends the original decoder state once every block. This is, of course, only one of the possible schemes 
for Markovian states and therefore if we optimize the r.h.s over all encoders that use a Markovian state 
of length I we get 

1 I *jr I 

A z > ^ - (46) 



V. Side Information at the Decoder 

A. Preliminaries and main result 

In this section, we assume that the decoder has access to SI. The SI sequence, Wi, W%, Wt, W t G W, 
is generated by a discrete memoryless channel (DMC), fed by X\, X 2 , X T : 

T 

P(iOi, . . .,w T \x 1 , ...x T ) = JJP(tt> t |ar t ). 

t=i 

For simplicity, we assume that P(w\x) > for all X G X and W G W. Our results, however, will 
continue to hold without this assumption with minor changes to the length function (see U4J). The SI 
is used both in the reproduction function and in the state update function. We assume that the state now 
consists of two sub-states. The first, Z\ G Z v , is independent of the SI and is updated as in Section |nj 
The second, Z™ G Z w , is updated by 

Z? = r?(W 1 ,Y 1 ), 

Z? = r?{W t , Y t , Z^), t = 2, 3, . . . , T. (47) 
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The reproduction symbols are produced by a sequence of functions {ft}, jt : } x W x Z w x Z y — > X 
as follows: 



X 1 = g 1 {W 1 ,Y 1 ), 

X t = g t (W t , Y t , Z?_ x , Z\_ x \ t = 2, 3, . . . , T. 



(48) 



Since Zf is not known at the encoder, it cannot be used by the variable-length encoder and thus the cost 
function is now given by 



Jt = E{p t (X t ,g t (W t ,Y t ,Z?_ 1 ,Zl l )) + \L Yt (Zl l )} 



(49) 



Let B t — -Pz t m |x*('|^*) and b t = Pz w \x*{'\ xt )> i- e -' ^ e ^ " * s a probability measure over the sub-state 
of the decoder, Zf ', which is not known to the encoder. Note that since the decoder does not have access 
to x l , b t is not known to the decoder. Our system model with SI is depicted in Figure [Ij 
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Fig. 2: System model with SI. 
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The following two theorems are the contribution of this section. 

Theorem VI. For a Markov source and any given sequence of memory update functions {r*}, reproduction 
functions {g t } and distortion measures {pt}, there exists a sequence of deterministic encoders Y t = 
ft(Bt-i, X t , Z\_ x }, which is optimal. 

The last theorem basically states that the results of [3]| continue to hold in this setting as well. 
As in Section III When Z\ = Y l , we have the counterpart of Theorem |n] for the SI setting when the 
optimal reproduction functions are used: 

Theorem VII. For a Markov source and any given sequence of SI memory update functions {r™} and 
distortion measures {pt}, when Z\ = Y l and the optimal reproduction function are used, there exists a 
sequence of deterministic encoders Y t = /t(Px p t ,z^. 1 |yt_i('j "b* -1 )) Xt) which is optimal. 

Note that unlike the result of Theorem [VTJ in the setting of Theorem VII[ the encoder does not need 
to store B t -i which is a function of X' -1 . Instead it stores the joint conditional probability measure of 
(X t , Z™_ x ), which is a function of Y l ~ l . There is no contradiction between the theorems since the setting 
of Theorem VII is different both in the use of the optimal reproduction functions and in the SI independent 
sub-state of the decoder. 
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The proof of Theorem VI follows the lines of the proof of Theorem [I] after Lemmas [TJlV are extended 



to the setting of this section. The changes to Lemmas |n} [TV] are quite simple (roughly speaking, instead of 
x t write (b t -i,x t ) everywhere in the proof). The extension of the two- and three-stage lemmas (Lemmas 
I|III ) is more involved and is given in the next two subsections. After these lemmas will be proven, the 



remainder of the proof is the same as in the previous section and therefore, will be omitted. Theorem VII 



is proved in Subsection V-C 



B. Theorem ^^proof outline 



We redefine B t , b t to be B t = -Pzf|x*,y*("|^*> Y*) an d b t = Pzw\x*,Y*('\ xt : ?/*)■ Since Theorem 
that the encoders can be deterministic, the conditioning on Y l in the definition of B t is redundant since 
the sequence of encoder outputs F* is a deterministic function of the source symbols X 1 . However, in 



VI 



states 



the proof of Theorem VI since we are allowing stochastic encoders a-priori, Y* adds information to X 1 
and therefore, this conditioning is needed. We precede the proof of this theorem with a short discussion 
regarding its significance. Since fc t _i is a deterministic function of (a;* -1 , y* -1 ), one may argue that this 
theorem does not simplify the structure of the general encoder, which is, anyway, a function of x t ,y t ^ 1 . 
However, it turns out that the encoder can update b t recursively using only the data that is available to it 
at each stage (i.e., X t , Y t , B t _{). To see why this is true, observe that 

b t (z) = P(z? = zlxW) = Pirriyuwuzt,) = z\x\y l ) 

E P{w u ztM,y l ) 

Wt ,z™_ 1 :rf ( y t ,wt 1 )=z 

E Piwtlx^y^PiztM,^,^) 

m,zf_i-rf (yt,w t ,z^L 1 )=z 
w u z^:r™{y uWu z™_ x )=z 



A 



h{b t -x,x u y t ,z) (50) 



Since this is true for any z £ Z w ', we showed that b t is a function of (b t _i, x t , y t ). Therefore, the encoder 
can recursively update b t at the end of each encoding stage using its knowledge of (b t _i,x t ) and its last 
output y t . 

1) Two-stage lemma: We start by analyzing a system with only two stages, where the first encoder is 
known. 

Lemma V. In a two-stage system (T = 2), there exists a deterministic second-stage encoder, Y 2 = 
f 2 (Bi, X 2 , Zf) which is optimal. 

Proof of Lemma [V[ Note that Ji is unchanged by changing the second stage encoder. Denote the set 
of stochastic encoders which are functions of (Xi, X 2 ,Y\) by {f s X 2 Yl }- Th e minimization of J 2 can be 
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written as 



J* = inf E{E [p 2 (X 2 ,g t {W 2 ,Y 2 ,Z^,zf)) + XL Y2[z y(Zf)\X 1 ,X 2 ,Y 1 ,Y 2 ,Z y 1 ]} 



{f S x 2 Y } 



inf E {XL Y2lz v {ZD + E [p 2 {X 2 , g t {W 2 , Y 2 , Zf, Zf))\X u X 2 , Y u Y 2 , Z\\ } . (51) 



if 3 x 2 Y } 



Focusing on the inner conditional expectation, we have 



E [p 2 (X 2 , g t (W 2 , Y 2 , Z?, Zf))\X u X 2 , Y u Y 2 , Zf] = 
= P{w2\X 2 )P{z^\X 1 ,Y 1 ) P2 {X 2 ,g t {w 2 ,Y 2 ,z^ ) Z y l )) 



W2,Z 

A 



p 2 {B x ,X 2 ,Y 2 ,Zf) (52) 



A 



where B\ = P z w\ xim (-\Xi, Yi) is a probability measure on Zf that represents the encoder's belief on the 
decoder's unknown state. Note that B\ is a deterministic function of [X\, Y\) and the modified distortion 



measure (J52J) depends on (Xi,Yi) only through B\. Combining pT| ) and ([52]) we have 



J* = inf E{p 2 {X 2 ,Y 2 ,B u Zl) + \L Y2 \ z v{Z\)} (53) 
{/ x2yi ) 

where the expectation is with respect to P(6i, x 2 , 2/2, zf). Consider the quadruple of RV's (Bi,X 2 , Y 2 , Zf). 
We have 

P{bi,x 2 ,y 2 ,zf) = P{b x ,x 2 ,zf)P{y 2 \b 1 ,x 2 ,zf). (54) 

While P(b\, x 2 , zf), which depends on the first stage design and the source, remains fixed in the opti- 
mization in ( [53] ) (it can be thought of a state of the system, governed by the choice of the first stage 
design), P(y 2 \bi,x 2 , zf) depends on the second stage encoder since 

P{m\h,x 2 ,Zi) = 

52xi,vi P ( x i' x 2, yi)P(yz\xi,X2, yi)P(b 1 \x l , y 1 )P{zf\y x ) 
P(bi,x 2 ,zf) 

in the last expression, P(6 1 |x 1 ,yi) = 1 for all Xi,yi that yield the same specific conditional distribution, 
&i, over Zf and zero otherwise. P(y 2 \x\, x 2 , y\) is governed by the second stage stochastic encoder, which 



maps (xx,x 2 ,yi) to a probability measure on y. Let us now look at the expectation in ( [53] ): 

E{p 2 (X 2 ,Y 2 ,B 1 ,Zf) + L Y2{z y(Zf)} = 

P( b i,X2,zf)P(y 2 \b 1 ,X2,zf)l p 2 {x 2 ,y 2 ,bi,zf)+ 



mm P(y2\b[,x' 2 ,zl)P(b[,x' 2 \zf)l(y' 2 )\. (56) 

b' lt x 2 ,y 2 ) 

As in the proof of Lemma [IJ from ( |56] ) we see that the optimization will be affected by the choice of the 
second stage encoder through P(y' 2 \b' 1 ,x 2 , zf) for all {b[ , x' 2 ,zf). Denote the subset of stochastic second 
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stage encoders that are functions of (bi,x 2 ,zf) by {f s BlX2 z y }- Since (bi,z\) are functions of (x\,yi), 
{f Bl x 2 z y } c {fx 1 x 2 Y 1 }- From p5| ) we see that every specific f 2 £ /j^^Yi * s mapped to some specific 



Since the optimization is affected only by P(y 2 \bi, x 2 , zf), if instead of using a specific 
/ 2 , we would use f 2 that result from it, we would not change the joint probability of the quadruple 
(Si, X 2 , Y 2 , Zf) and thus the second stage cost will not be changed. Therefore we conclude that 



^inf J 2 



{f B lX2 zy } 



(57) 



To complete the proof, we need to show that it is enough to search in the finite subset of deterministic 
functions of (pi, x 2) zf), which we denote by {f Bl x 2 z y }- This 1S ^ one by repeating the arguments we used 
below ( [T5| ) in the end of the proof of Lemma [TJ ■ 

2) Three-stage Lemma: 

Lemma VI. In a three -stage system (T = 3) with a Markov source, if the third-stage encoder is 
a deterministic function of (B 2 , X 3 , Z\), then there exists a deterministic second stage encoder Y 2 = 
f 2 (Bi, X 2 , Zf) which is optimal. 



Proof of Lemma VL We define, as in we did in Subsection V-Bl {f Xl x 2 Y 1 } to be me set of all possible 



stochastic second stage encoders. Let {f BlX2 z y } C {fx 1 x 2 Y 1 } be the subset that contains all stochastic 
second stage encoders that are functions of (Si,X 2 , Zf) and finally, let {f BlX2 z y } c {$BxX 2 z y } denote 
the set of deterministic encoders which are functions of (B 1 ,X 2 , Zf). Since the first stage is fixed, J x is 
unaffected. Our goal is to jointly optimize ( J 2 + J 3 ) with respect to the second stage encoder and show 
that 



inf (J 2 + J 3 ) = min (J 2 + J 3 ) 

UxiXoXi } {ft v _ 7 y} 



We start by focusing on the third stage cost. 

J 3 = E {p 3 (X 3 , g 3 (Y 3 , W 3 , Z™, ZD) + XL Ya{z y{Z y 2 )} 
= E\\L Y3lz v(Zl) + E \p 3 (X 3 , g 3 (Y 3 , W 3 , 2%, Z\)) 



X\Y\Z\ 



Focusing on the inner expectation of ( |59| ), we have 



E{p 3 (X 3 ,g 3 (Y 3 ,W 3 ,Z^,Z y 2 )) 



X\Y\Z\ 



X\Y\Z\ 



Elp 3 (X 3 ,g 3 (Y 3 ,W 3 ,Z™,Z%)) 
£ P(w 3 \X 3 )P(z%\X 2 , Y 2 )p 3 (X 3 , g 3 (Y 3 , w 3 , z%, Z\)) 



p 3 (B 2 ,X 3 ,Y 3 ,Z%) 



(58) 



(59) 



(60) 
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where the first equality is true since Y 3 is a function of (B 2 ,X 3 , Z 2 ) and B> 2 is a deterministic function 
of (X 2 ,Y 2 ). Therefore, 

^3= Yl p ( b 2,x 3 ,z y 2 )P(y 3 \b 2 ,x 3 ,z y )x 

b2,X3,y3,z% 

[p 3 (b 2 ,x 3 ,y 3 ,z^) + ALy 3 |zj(4)] • ( 61 ) 
In the last expression, P(y 3 \b 2 , x 3 , z 2 ) will not be affected by the optimization of the second stage encoder 



since, under the assumptions of Lemma VI the third encoder is a fixed deterministic function of (b 2 , x 3 , z 2 ) 
(i.e., P(y 3 \b 2 , x 3 , z y 2 ) = t {f 3 (b 2 , x 3 , z 2 ) = y 3 }). Thus, the second stage encoder affects the last expression 
only through P(b 2 ,x 3 , z%) since 

P(b 2 ,x 3 ,z%)= Y P{bi,x 2 ,zl)P(y 2 \b 1 ,X2,zf)x 

P(x 3 \x 2 )t {h(h,x 2 , y 2 ) = b 2 } 1 {r y (z y , y 2 ) = z y } , (62) 



where h(bx, x 2} y 2 ) was defined in ( [50| ) and we used the fact that the source is Markov. As in subsection 
V-Bl P(bi,x 2 , z\) is the result of the first stage design and the source. Note that 1 {h(b\, x 2 , y 2 ) = b 2 }, 
1 {r 2 J (z y ,y 2 ) = z y } are not affected by the choice of the second stage encoder since they represent 
known deterministic functions of (bi,x 2 ,y 2 ) and (z y ,y 2 ) respectively. Focusing on the third stage average 
codeword length for Z y = z y , we have 



L Y 3 |^(4)= mm 52 p ( b 2,x 3 ,y 3 \z^)l(y 3 ) 

°2,X3,V3 



min V P{b 2 ,x 3 \zl)P{y 3 \b 2 ,x 3 ,zl)l(y 3 ). (63) 



Again, the second-stage encoder affects only P(b 2 , x 3 , z%) (and thus P(b 2 , x 3 \z 2 )). In (f55]),([56|) we showed 
that the second-stage encoder affect the second stage cost only through P(y 2 \bi, x 2 , z\). In (|6T),(|62"1),(|63|) 
we showed that the third stage cost depends on the second stage cost only through P(y 2 \bi, x 2 , z\). 
Therefore, we conclude that the optimization of the second stage encoder affects ( J 2 + J 3 ) only through 
P{y 2 \b\ ) x 2 , z y ). Repeating the arguments we used in the proof of Lemma V if we use f 2 G f s BlX2Z v 
that result from a specific f 2 E fx 1 x 2 Y 1 (through ( |55| )) instead of using that specific f 2 , we would not 
change the joint probability of (£> 2 , X 3 , Y 3 , Zf, Z\) and therefore will not change the value of ( J 2 + J 3 ). 
Therefore we have 

min (J 2 + J 3 ) = min ( J 2 + J 3 ) . (64) 

{fx 1 X 2 Y 1 } {f^XnZ*} 



From here, the same arguments we used after (23) in the proof of Lemma III will complete the proof. 



C. Infinite memory decoder - proof of Theorem VII 



As in the case without SI, when Z y = Y l , we can use the tools of MDP to derive a structure theorem. 
We will need to redefine the state to s t = Pxt^z^Y 1 - 1 ^ •> '\y t l )- The action is defined in the same manner 
as in the case without SI, i.e., at-X—^y. The optimal reproduction function, x* t = g*(w t ,y t ,z™_ 1 ) is 
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the Bayes response to P Xt \w t y\z^ x {-\wuy\^_^): 

x* t = argmax y P(x t \w t ,y t ,z^_ 1 )pt(x t ,x). (65) 

X ' * 

Xt 

As in Subsection |III-E[ in order to use the tools of MDP, we need to show that we can write the cost 
function as a function of (s t ,a t ) and that the state is conditionally Markov, given a t . 

The optimal reproduction function is a function of Px^WtV 1 ,z™_ 1 {-\wt-,y t ■, z t-i)- Note that 



P{x t \w u y\ztx) 



E^K^^-ill/'" 1 ) 

P{x t , z^lyt-^Pjwtlxt)! {atjxt) = yt] 
J2 X , P(x' t , z^lyt-^PiwM)! {a t (x' t ) = y t } 

f(s t ,at,Wt,xt,Vt,Zt-i) (66) 



JX 

A 



Therefore, the optimal reproduction function is a function of (s t ,a t ,w t ,y t , z^_^}, which we denote by 
g^(s t , a t ,w t ,y t , z™_{). We now move on to show that the cost function can be written as a function of the 
state and action. As in Subsection |III-E[ we deal with the distortion and codeword length elements of the 
cost separately. Treating the expected codeword length first we have: 



yt 

= min P^yuz^ly^liyt) 

= min V P(x t ,z?_ l \y t - 1 )l{a t (x t )=y t }l(y t ) 

Vt,xt, *t_i 

= a t (s t ,a t ), (67) 
When using the optimal reproduction function, the average distortion is given by: 



E 



Y 1 - 1 = y 1 - 1 



^2 p ( x t>Vt> zlZ-ilv* l )p{ x t,9*t{st,at,Wt,yt,Zt_ x )) 

w t ,xt,yf,Zt-\ 

p ( x *' ^lb* -1 )-^*!^) 1 M^t) = yt} p(x t , g*t(s t , at, w t , y t , zf_ x )) 



wt,xt,yt,Zt_ x 

A 



Pt(s t ,a t ). (68) 

Denoting (3 t (s t , a t ) + Xa t (s t , a t ) = 7 t (s t , a t ), our optimality criterion can be written as ^ Ylt=i Elt(s t , a t ). 

We move on to show that the state process is Markov conditioned on the action, namely, P(s t+ i|s*, a*) = 
P(s t+1 \st,a t ). We start by noting that s t+1 = Px t+1 ,z?\Y*{; is a function of (s u a t ,y t ). For every 
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(x t +i, zf) we have 

T. WuXt , zr _ x p (%t, x t +i, w t , y u z^z^lyt- 1 ) 
l^w^ux^z^ p ( x t, xt+i, w t , y t , zf_ 1: zX\f x ) 
p (. x u zT-ily^P&t+ilxJPiwtlxt)! {a t (x t ) = yt} 1 {zf = r?(w t ,y t , zf^)} 
~ J2 Wt , XuXt+1 , z ^ Pixuzr-M-^Pixt+ilxjPiwtlxt)! {a t (x t ) = y t } 1 {zf = rf{w t ,y u z?^)} 
= f(s t ,a t ,x t+1 ,yt,z™) (69) 



and therefore, s t+ i = h(s t ,a t ,yt) for a function, h, that uses ( f69| ) for every pair (x t+1 ,z™). Now, 



P(s t+1 = v\s\a t )= i{h(a u s t ,y t ) = v}l{a t (x t )=y t }P(x tl z?_ 1 \y t - 1 ) 

= P(s t+1 = v\s t ,a t ). (70) 



We showed that our system can be represented as a MDP. By invoking Theorem |A. 1 , we know that the 



optimal action at each stage, a t is a deterministic function of the state. Namely, the mapping from x t to 
y t can be chosen deterministically as a function of Px t ,z™_ jy*- 1 ('> " 1 2/* 1 ) - Therefore, Y t is a deterministic 



function of (X t , -Pxt.z^jy*- 1 * |y* which concludes the proof of Theorem VII By ( |69] ), the encoder 



does not need to store y* l , but rather a probability measure (a vector in R^N- 2 ™!). 

VI. Conclusion 

This work extended the setting of [1J to include both variable rate coding and SI. It was shown that 
structure theorems, in the spirit of 01 and 0, continue to hold in this setting as well. These theorems 
are further refined when the decoder has infinite memory. We were able to show that the cost function 
is concave in the choices of past encoders (Lemmas |n} [rv] ) and therefore, the optimal encoders are 
deterministic. It was also shown that in order to simplify the overall system optimization, one can use 
sliding-window next-state functions and the excess loss incurred by this suboptimal choice vanishes as 
the window size increases (Theorems HI, |TV>. However, in the finite horizon setting we investigated, the 



window size is always upper bounded by the time horizon. 

Extensions to this work would include investigating the infinite horizon setting. While Theorem III 
carries over verbatim to the infinite horizon setting, it is not necessarily true for the other theorems, which 
were proved using dynamic programming. Another extension would be to investigate the constrained 
setting (briefly mentioned in Theorem [V]). In this case, relying on results from constrained MDPs, we do 
not expect the optimal encoders to be deterministic (see lfT5l ). 

Appendix 

A. Proof of Lemma |77] 

We start by focusing on the average codeword length element of the cost function and show that 
L Yt {Z t -i) is concave in {f s XhZt _ x }. For < a < 1 and j\,f 2 e {/^.J, let 

fa = afi + (1 - a)f 2 . 
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This means that for any (x t , zt-i) we have 

P a (yt\xt, zt-i) = oiP x {y t \x u z t -{) + (1 - a)P 2 (y t \x t , z t ^ x ) (A.l) 

Let L%(Z t Ly^iZt-x), Ly t {Z t -\) denote the length function calculated with f a , fi, f 2 respectively. 
We have 



4:(4-i) = mm ^PaiytMKyt 
l(-)EA I z — ' 
Vt 



mmj^P^ 



x ti z t-i) 



yt,x t 

+ (1 - a)P 2 (y t \x t , zt-i)] P(x t \z t -i)l(yt, 



> a min <^ P 1 (y t \x t , z t -i)P(x t \z t -i)l{yt) > + (1 - a) min <^ P 2 (y t \x u Zt-i)P(xt\z t -i)l{yt) \ 
yt,x t ' v j/t,xt 

= aLf t (z t ^) + (1 - ajLgC^j) (A.2) 

where we used the fact that the sum of minima is smaller than the minimum of a sum. Since the distortion 
part of the cost is linear in P(y t \x t , z t -\) (through the expectation), we have that the overall stage t cost 
function is concave in {fx t z t -\i- 



B. Proof of Lemma IV 



Fix any third stage encoder which is a deterministic function of (X 3 , Z 2 ). We showed in ( fT9| ) that the 
second stage encoder affects J 3 only through P(x 3 ,z 2 ) (and thus also through P(z 2 ) and P(x 3 \z 2 )). Let 
/i> h e {fx 2 Zi} be two second stage stochastic encoders which are functions of (X 2 , Z\). Let 

P 1 (x 3} z 2 ) = ^ p ( x 2,Zi) hPi(y2\x 2 ,Zi) + (1 - l)P 2 {y 2 \x 2 , zx)\ P(z 2 \z 1 ,y 2 )P(x 3 \x 2 ), 

£2,2/2,21 

= 7 Pi(z 3 , «2) + (1 - 7)^(^3, z 2 ), (A.3) 

where P\(x 3 , z 2 ), P 2 (x 3 , z 2 ) are calculated with P\(y 2 \x 2 , z±), P 2 (y 2 \x 2 , zi)} that result from / 1; / 2 respec- 
tively. Similarly, for 2 = 1,2,7, define Pi(z 2 ), and Pi(x 3 \z 2 ) as the marginal and conditional distribution, 
respectively, resulting from the probability measures in ( |A.3[ ). We now show that P 1 (x 3 \z 2 ) can be written 
as a convex combination of P\{x 3 \z 2 ),P 2 {x 3 \z 2 ). 

P,(x 3 \z 2 )^ P ^ Z2) 



-fPi(x 3 , z 2 ) + (1 - l)P 2 (x 3 , z 2 ) 
£ , 7Pi(x' 3 ,z 2 ) + (l-j)P 2 ( x ' 3 , Z2 ) 



7Pi(x 3 ,2 2 ) (1 - 7)P 2 (^3, 22) 



E 4 7^1 04, *0 + (i - 7)^2(4, *a) £4 7^1(4, * 2 ) + (i - 7)A(x 3 , 22) 

^ Pl(X3,^2) | ^ P2(X 3 ,Z 2 ) ( 

Exi p i (4» ^2) £*•'„ p 2 (4. ^) 
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with, 



/3 



E 4 7^1 + (1 -j)P2(x' 3 ,z 2 ) P 7 (z 2 ) 

(i-l)E x >P2(x' 3 ,Z 2 ) (1- 7 )P 2 (Z 2 ) 



£ 4 7^1(^2) + (1-7)^2(4^2) P 7 te) 
Note that < a, (5 < 1 and a + (5 = 1. We showed that 

P 7 {x 3 \z 2 ) = aPi{x z \z 2 ) + (1 - a)P 2 (x 3 \z 2 ) 



(A.5) 



(A.6) 



We are now ready to prove the lemma. For any given third stage encoder, let J 3 (P{), i = 1, 2, 7, denote 
the third stage cost as a function of the joint probability of (X 3 , Z 2 ), where the dependence on the second 
stage encoder was shown in ([20l),( |A.3| ). In order to prove the lemma, we need to show that: 

HP,) > lUPi) + (1 - 7)^2). (A.7) 

We now focus on the codeword length element of the cost function. Let L 3 (Pj), i = 1,2,7, denote the 
third stage average codeword length as a function of the joint probability of (X 3 , Z 2 ) 



~2 



!J>, 



^2P~t(z 2 ) min ^ P 7 (y 3 ,x 3 \z 2 )l(y 3 ) 



^P y (z 2 ) min ^ P-y(x 3 \z 2 )P(y 3 \x 3 , z 2 )l{y 3 ) 



V3 ,%3 



V P y {z 2 ) min V [aPi(ar 3 |z3) + (1 - oc)P 2 (x 3 \z 2 )} P(y 3 \x 3 , z 2 )l{y 3 ) (A.8) 



> y P 7 (z 2 )a min V] Pi{x 3 \z 2 )P(y 3 \x 3 , z 2 )l(y 3 )+ 
L — ' l(-)eA 



V3,X3 



y^Pj(z 2 )(l - a) min V" P 2 (ar 3 | 22)^(3/3 |a; 3 , z 2 )l(y 3 ) 



(A.9) 



J/3 ,.''3 



VP 7 (z 2 ) 7 ^ Z \ min V Pi(x 3 \z 2 )P(y 3 \x 3 , z 2 )l(y 3 )+ 



5> 



=2 



~2 



2/3,^3 

:i-7)a(^ 2 ) 
^(22) 



min V" P 2 (a; 3 |2; 2 )P(?/3|x3,Z2)i(y3) 



2/3, ^ 3 



7^3(^1) + (1-7)^2) 



(A. 10) 
(A.ll) 



where in ( |A.8[ ) we used ( |A.6[ ), ( |A.9| ) is true since the minimum of a sum is greater than the sum of minima 
and finally, in (|A.10|) we substituted a given in (|A.5[). Thus, we showed that the codeword length element 



27 



of the cost function is concave in the choice of the second stage encoder. We have 

M p ~<) = ^2 P(V3\xs, z 2 )P 1 (x 3 , z 2 )p(x 3 , g{y 3 , z 2 )) + AL 3 (P 7 ) 

= 5^ "^(2/3^3, z 2 )Pi(x 3 , z 2 )p(x 3 , g(y 3 ,z 2 )) + 

%3,V3,Z2 

(1 - >y)P(y 3 \x 3 , z 2 )P 2 (x 3 , z 2 )p(x 3 , g(y 3 , z 2 )) + AL 3 (P 7 ) 
> 7 J 3 (P 1 ) + (1- 7 )J 3 (P 2 ) (A.12) 



where in the last step we used (AAA ) and the lemma is proven. 



C. Markov decision processes - short overview 

In a Markov decision process, a decision maker is influencing the behavior of a Markov probabilistic 
system through his actions, as the system evolves in time. Formally, a discrete time, finite horizon Markov 
decision process is defined by {T, S, A, {P t (-\s, a)}, {pt(s, a)}}, where, 

• T is the time horizon, t — 1, 2, . . . , T. 

• S is the state space. 

• A is the action space. 

• P t (-\s,a) is the transition probability to the systems's next state, given the previous system state 
and action. The transition probabilities obey P t (-\s t ,a t ) = P t (-\s t ,a t ), namely, the next state, s t +i, 
distribution depends on the history only through (s t ,a t ). 

• -Po(') is the probability measure over the initial state. 

• p t (s, a) is the cost incurred when at stage t and state s, action a is taken. 

In our case, the goal of the decision maker is to minimize the expected average cost Eh Y^t=i Pt{St, A t ). 
The history of the process at stage t is h t = (s 1; ax, s 2 , a 2 , • • • , St-i, a t _i, s t ), i.e., all previous actions 
taken by the decision maker and the system states, up to stage t. Note that h t = {h t -x, «t-i, s t }. 

A decision rule, d t , prescribes the procedure for action selection in a given state at stage t. Decision 
rules can range from deterministic functions of the current state to randomized functions that depend on 
the whole history of states and actions, up to stage t. A decision rule that is a deterministic function 
of the current state will be called a Markovian deterministic (MD) decision rule. A policy specifies the 
decision rules to be used at all stages, i.e., a policy n is a sequence of decision rules dx, ■ ■ ■ , d T . We say 
that a policy is MD if all its decision rules are MD. 

In Sections III-EIV-C , the state space is finite, however, it grows as the system evolves, i.e., at each 



stage the stage space is S t . We set S = Uj =1 S t . The action space is the set of deterministic functions 
/ : X — > y, which is finite. 



We will use the following theorem which is the key to the results of section III-E 

Theorem A.l. ( IU6\ Proposition 4.4.3]): There exist an MD policy which is optimal. 

We outline the proo f here for completeness. 
Proof of Theorem 



A.l 



(outline): Define for policy 7r, uj(h t ) = E ^YlJ=t P«( s *> j> where the actions 
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at are prescribed by the policy ir. Note that 



u 



7T 



(h t ) = Pt(s t ,a t ) + ^2pt(j\st,a t )u% +1 (h t ,j,a t ) 
jes 



(A.13) 



Let u* t {h t ) = inf^ u^(h t ). We start by showing the u* t {h t ) depends on the history only through s t . We will 
use backwards induction. Note that u^(hr) = min aev 4 p(sr, a), so the claim is valid for the last stage. 
Now assume that the claim is valid for n — t + 1, t + 2, . . . , T. We have 



I jes ) 

where the last equation is due to the induction hypothesis. Since the term in brackets depends on the 
history only through s t , the induction step is proven. Now, define the decision rule at each stage for every 
s t E S as the minimizer of ( |A.14[ ). By construction, this decision rule is MD and the policy constructed 
from these decision rules is optimal. 

D. Properties of the length function 

Let W, y, Z be finite alphabets. 

1) Conditioning reduces the length: We have 



where the inequality is true since the minimum of a sum is greater than the sum of minima. 

2) Proving that Ly\w,z > Ly,z\w — log \2\: The intuition behind this is simple: given W, the average 
optimal code length for the pair (Y, Z) can not be larger than coding Z separately and concatenating a 
codeword that describes Y and is decodable when Z is known. The optimal scheme for coding the pair 
can not be worse, otherwise, this scheme can be used. To see this mathematically, for each y E y, z E Z, 
let l*(y),l*(z) be the length functions optimized for the distributions P(y\w,z) and P(z\w) respectively. 




(A. 14) 




(A.15) 
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Using the fact that L z \w < log | Z\ we have: 



L Y \z,w +]og\Z\ > 



^2 p(w, z) E p (y\ w > z ) 1 * (v) + E p W E *W (*) 




E pm E p (» E p ^i w > *) r ^) + 



iu 2 |_ \ v 



p(w) E p (» E p (^> + **(*)] 



z 




(A.16) 
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